Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

first attempt to support awkward arrays #647

Merged
merged 129 commits into from
Feb 7, 2023
Merged

first attempt to support awkward arrays #647

merged 129 commits into from
Feb 7, 2023

Conversation

giovp
Copy link
Member

@giovp giovp commented Nov 14, 2021

First draft at supporting awkward arrays.
As discussed with @ivirshup this would be useful for Squidpy @hspitzer (discussed also here #609 ) and potentially @Zethson EHR project.

Here's a walkthrough showcasing some basic functionality we could use for sub observation annotations (e.g. spatial coordinates of rna-molecules/segmentations).

Details
import numpy as np
import scanpy as sc
import squidpy as sq
import matplotlib.pyplot as plt
from numpy.random import default_rng
from sklearn.datasets import make_blobs
import awkward as ak

adata = sq.datasets.visium_hne_adata()
adata = adata[:10, :].copy()
adata.obsm["spatial"] = (
    adata.obsm["spatial"] - np.std(adata.obsm["spatial"], 0)
) / np.mean(adata.obsm["spatial"], 0)

# generate data
obs_list = []
rng = default_rng(42)
for idx in adata.obs_names.values:
    coord, _ = make_blobs(
        n_samples=rng.integers(5, 15),
        cluster_std=0.02,
        centers=adata[idx].obsm["spatial"],
        random_state=42,
    )
    obs_list.append(coord)

sub_obs = ak.Array(obs_list)

fig, ax = plt.subplots(1, 2, figsize=(10, 5))
ax[0].set_aspect("equal")
ax[0].scatter(x=adata.obsm["spatial"][:, 0], y=adata.obsm["spatial"][:, 1])
for i in range(adata.shape[0]):
    ax[1].scatter(
        x=sub_obs[i, :, 0],
        y=sub_obs[i, :, 1],
    )
    ax[1].axis("equal")

image

adata.obsm["sub_obs"] = sub_obs  # sub_obs is an awkward array
adata_subset = adata[:5]  # let's subset, it also works for `.copy()`

fig, ax = plt.subplots(1, 2, figsize=(10, 5))
ax[0].set_aspect("equal")
ax[0].scatter(
    x=adata_subset.obsm["spatial"][:, 0], y=adata_subset.obsm["spatial"][:, 1]
)
for i in range(adata_subset.shape[0]):
    ax[1].scatter(
        x=adata_subset.obsm["sub_obs"][i, :, 0],
        y=adata_subset.obsm["sub_obs"][i, :, 1],
    )
    ax[1].set_aspect("equal")

image

What fails atm is adata.concatenate() because of errors in reindexing of alternate axis
https://github.com/theislab/anndata/blob/286bc7f207863964cb861f17c96ab24fe0cf72ac/anndata/_core/merge.py#L478

in awkward arrays there is no shape attribute so when array.shape[0] is needed we can resort to len(awkward_array) but for array.shape[1] we should (probably) simply skip it and concatenate. In awkward it would be like this:

ak.concatenate([sub_obs, sub_obs])
>>> <Array [[[1.25, 0.833], ... [0.697, 1.22]]] type='20 * var * var * float64'>

TODO

  • wait for Specs for all elements #554 to get merged and work on IO
  • switch to v2 api of awkward. v1 EOL is in four months and bugs are not getting fixed already now
    - [ ] address newly added TODOs in the code
  • Implement outer joins during concatenation. Either implement inner/outer join logic or raise a corresponding warning
    • Raise a warning like: "Outer joins on awkward.Arrays will have different return values in the future. Please see github.com/scverse/anndata/issues#XXX for details, and offer input."
  • AwkwardArrayView with behavior

Tests

  • test that assigning an awkward v1 array fails
  • test the dim_len function
  • tests for concatenating awkward array with None or MissingVal
  • test awkward support in .uns
  • add specific test for more complex awkward arrays
  • add element-wise IO tests (check different types e.g. union, optiontype, ...)
  • Add tests for getting and setting awkward arrays (e.g. adata.obsm["x"] = awk)
  • Add tests for concatenating
    • Proper reindexing during concatenation
    • When a.obsm["x"] is a awkward array and b.obsm["x"] isn't (error?)
    • [ ] What's the right fill value for when join=outer and a.obsm["x"] exists but b.obsm["x"] doesn't
    • Are values being compared correctly when merge="equal" (add more tests for the equality function)

Docs

@codecov
Copy link

codecov bot commented Nov 14, 2021

Codecov Report

Merging #647 (e524389) into master (283b0c1) will increase coverage by 0.08%.
The diff coverage is 88.54%.

@@            Coverage Diff             @@
##           master     #647      +/-   ##
==========================================
+ Coverage   83.12%   83.21%   +0.08%     
==========================================
  Files          34       34              
  Lines        4416     4503      +87     
==========================================
+ Hits         3671     3747      +76     
- Misses        745      756      +11     
Impacted Files Coverage Δ
anndata/compat/__init__.py 80.97% <37.50%> (-2.08%) ⬇️
anndata/_core/views.py 87.90% <75.00%> (-0.99%) ⬇️
anndata/utils.py 83.33% <80.00%> (-0.60%) ⬇️
anndata/_core/aligned_mapping.py 94.02% <100.00%> (+0.09%) ⬆️
anndata/_core/anndata.py 83.41% <100.00%> (ø)
anndata/_core/index.py 92.85% <100.00%> (+0.35%) ⬆️
anndata/_core/merge.py 93.76% <100.00%> (+0.31%) ⬆️
anndata/_io/specs/methods.py 84.53% <100.00%> (+0.75%) ⬆️
anndata/tests/helpers.py 95.63% <100.00%> (+0.19%) ⬆️

@Zethson
Copy link
Member

Zethson commented Nov 14, 2021

Really cool @giovp and everyone involved. I can already see a couple of use-cases for ehrapy and am eager to see this move forward. Do you expect the support for akward array to also require modifications for all/many of Scanpys algorithms is this mostly a drop-in replacement?

CC @Imipenem

@giovp
Copy link
Member Author

giovp commented Nov 14, 2021

Do you expect the support for akward array to also require modifications for all/many of Scanpys algorithms is this mostly a drop-in replacement?

I doubt it, also can't think of any scanpy function that could use such representation out of the box. We'd have multiple use cases in squidpy though! Essentially being able to slice/subset/concatenate and copy anndata preserving the 0-axis of an akward array would cover most fo the cases I can think of right now.

@ivirshup
Copy link
Member

@giovp, for concat, AFAICT you can't have a multi-dimensional awkward array, so there won't be an "alternative" axis. Also, I don't think there is a logical way to concatenate an awkward array with any other kind of array, so it should probably just error. I think this could be handled similar to the all dataframe case.

Could you modify gen_adata in anndata.tests.helpers to add awkward arrays into obsm and varm? I think that'll help catch a lot of bugs.

@giovp
Copy link
Member Author

giovp commented Nov 15, 2021

@giovp, for concat, AFAICT you can't have a multi-dimensional awkward array, so there won't be an "alternative" axis. Also, I don't think there is a logical way to concatenate an awkward array with any other kind of array, so it should probably just error. I think this could be handled similar to the all dataframe case.

yeah you can't have multi-dimensonal akward array, but I think it would still be good to concatenate them across axis=0, and so this should be supported and hence escape the current alterante_axes check?

Could you modify gen_adata in anndata.tests.helpers to add awkward arrays into obsm and varm? I think that'll help catch a lot of bugs.

I will do that and run test locally, sorry for that

anndata/_core/anndata.py Outdated Show resolved Hide resolved
@ivirshup
Copy link
Member

yeah you can't have multi-dimensonal akward array, but I think it would still be good to concatenate them across axis=0, and so this should be supported and hence escape the current alterante_axes check?

Ahh yeah, this is what I meant. Basically have a case for all elements being awkward arrays.

@giovp
Copy link
Member Author

giovp commented Nov 15, 2021

Ok, at this stage concat works both for inner and outer on obsm, and subsetting works for varm.

Example
import numpy as np
import scanpy as sc
import squidpy as sq
import matplotlib.pyplot as plt
from numpy.random import default_rng
from sklearn.datasets import make_blobs
import awkward as ak
import pandas as pd
from cycler import cycler

sc.set_figure_params()

adata = sq.datasets.visium_hne_adata()
varm = adata.obsm["spatial"][15:30, :]
adata = adata[:10, :15].copy()
adata.obsm["spatial"] = (
    adata.obsm["spatial"] - np.std(adata.obsm["spatial"], 0)
) / np.mean(adata.obsm["spatial"], 0)

adata.varm["spatial"] = varm.copy()
adata.varm["spatial"] = (
    adata.varm["spatial"] - np.std(adata.varm["spatial"], 0)
) / np.mean(adata.varm["spatial"], 0)

obs_list = []
var_list = []
rng = default_rng(42)
for idx in adata.obs_names.values:
    coord, _ = make_blobs(
        n_samples=rng.integers(5, 15),
        cluster_std=0.02,
        centers=adata[idx].obsm["spatial"],
        random_state=42,
    )
    obs_list.append(coord)

for idx in adata.var_names.values:
    coord, _ = make_blobs(
        n_samples=rng.integers(5, 15),
        cluster_std=0.02,
        centers=adata[:, idx].varm["spatial"],
        random_state=42,
    )
    var_list.append(coord)

sub_obs = ak.Array(obs_list)
sub_var = ak.Array(var_list)

def plot_points(adata, main: np.ndarray, sub: ak.Array, axis: int, cmap_name):

    fig, ax = plt.subplots(1, 2, figsize=(10, 5))
    ax[1].set_prop_cycle(
        cycler("color", plt.get_cmap(cmap_name)(np.linspace(0, 1, len(main))))
    )
    ax[0].axis("equal")
    ax[0].scatter(
        x=main[:, 0], y=main[:, 1], c="grey", edgecolors="black", s=100, linewidths=1
    )
    for i in range(adata.shape[axis]):
        ax[1].scatter(
            x=sub[i, :, 0],
            y=sub[i, :, 1],
            edgecolors="black",
            alpha=0.7,
        )
        ax[1].axis("equal")

    return

plot_points(adata, adata.obsm["spatial"], sub_obs, 0, "winter")
plot_points(adata, adata.varm["spatial"], sub_var, 1, "cool")

adata.obsm

image

adata.varm

image

Slicing and copying also works

adata.obsm["sub_obs"] = sub_obs  # sub_obs is an awkward array
adata.varm["sub_var"] = sub_var
adata_subset = adata[:5, :7]  # let's subset

If I run test locally with pytest anndata/tests no tests fails, not sure, where shall I start looking?

@ivirshup
Copy link
Member

I think you just need to add awkward array to the testing dependencies

@ivirshup
Copy link
Member

Also, for gen_adata to return an object with awkward arrays in it, you've got to generate them and place them here:

https://github.com/theislab/anndata/blob/286bc7f207863964cb861f17c96ab24fe0cf72ac/anndata/tests/helpers.py#L111-L121

@giovp
Copy link
Member Author

giovp commented Nov 16, 2021

Also, for gen_adata to return an object with awkward arrays in it, you've got to generate them and place them here:

🤦 🤦 🤦

So I'm at stage where I could fix+add tests for awkward array for ages and am happy to do so but would like to get a sense of how welcomed this PR is. As it stands, we'd have to include awkward array as a (optional) dependency and it would be a non-negligible addition to the code base (although I expected worse, with singledispatch made it quite slim).

A major thing that could be impactful is that there is no .view() or .copy() notion in awkward array, see here and here.

Overall, I think we could use it right away in Squidpy, mostly as a way to enable anndata slicing and indexing while retaining sub-obs or sub-var info. We would not really use it to do any arithmetics or stuff like that, although it would be cool to come up with function ideas and also the fact that it support numba and jax jitting is cool.

So, to summarize, shall I go ahead? @ivirshup @hspitzer @Zethson @michalk8 ?

@ivirshup
Copy link
Member

I'm all for it. Having more record like data has been requested a number of times.

I would note that there are some things in the awkward array api that will change (scikit-hep/awkward#1151), but that's mostly down the line stuff.

A major thing that could be impactful is that there is no .view() or .copy() notion in awkward array

I believe it does have copy, just not as a method. We might be able to get them to add that?

@ivirshup ivirshup self-requested a review February 2, 2023 17:38
@ivirshup
Copy link
Member

ivirshup commented Feb 2, 2023

@grst tests are passing!

I think we need to open some issues on behavior we want to change, like the unions from outer concatenation. I would also like to take a look over the coverage, and see what we're missing.

How's the tutorial going?

@grst
Copy link
Contributor

grst commented Feb 2, 2023

How's the tutorial going?

I can finish that beginning of next week

anndata/_core/views.py Outdated Show resolved Hide resolved
@jpivarski
Copy link

Congratulations!!! This is not an ordinary PR. :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants